home *** CD-ROM | disk | FTP | other *** search
- Development History Of a File Compression Utility
- by Richard Greenlaw
- 251 Colony Ct.
- Gahanna, Ohio 43230
-
- May 18, 1981
- Revised June 6, 1981
-
-
- Introduction:
-
- The file compression system consists of two programs, SQ and
- USQ, meaning squeeze and unsqueeze. They are written in the
- C language for the BDS C compiler. The executable files are
- SQ.COM and USQ.COM, which are self-sufficient to run under
- the CP/M operating system and consist of 8080 machine code.
-
- SQ.COM is compiled from files SQ.C, SQDIO.C, TR1.C, TR2.C,
- IO.C, SQ.H, DIO.H and SQCOM.H. USQ.COM is compiled from
- files USQ.C, USQDIO.C, UTR.C, USQ.H and (again) DIO.H and
- SQCOM.H. Both COM files also include standard library
- functions and the BDS-C run-time package.
-
- SQDIO.C and USQDIO.H provide i/o redirection and pipe
- simulation and are just the BDS dio package renamed to
- produce distinct CRL files corresponding to the different
- addresses of external variables with which they are
- compiled.
-
- The SQ program builds a squeezed file by applying two
- transformations:
-
- First, byte values which are repeated consecutively three or
- more times are reduced to the value, the token DLE
- (delimiter), and a count. The penalty is that occurrances of
- DLE are encoded as DLE, zero.
-
- Second, the Huffman algorithm encodes each resulting byte
- value or endfile as a bit string having length inversely
- proportional to its frequency of occurrance. This is a
- complex process requiring reading the input file twice.
-
- The squeezed file contains various information to allow the
- USQ program to decode it and recreate the original file
- exactly.
-
- The environment:
-
- The programs should be nearly portable. The CP/M system
- actually used is single user 2MHz 8080 without interrupts.
-
- The BDS C compiler supports a subset of C. It does not
- support register variables, long integers or floats. That
- leads to complexity in collecting and processing the
- frequencies of occurance of the various byte values being
- encoded.
-
- Outline of SQ
-
- The interesting work begins in function squeeze in file
- SQ.C. In the first pass, init_huff in file TR2.C reads the
- input through the first encoder, getcnr, in file TR1.C,
- collects the frequency distribution and builds the decoding
- and encoding structures. Then wrt_head in file TR2.C writes
- control information and the decoding structure to the output
- file.
-
- In the second pass, encoded bytes consisting of bits and
- pieces of bit strings are generated by function gethuff in
- file TR2.C and are simply written to the output by squeeze.
-
- Development History of SQ
-
- There have been seven operational pre-release versions of
- SQ. The motive for change in each case was primarily
- increased execution speed, although the conveniences of
- operating on lists of files, automatic name generation for
- squeezed files, and output drive specifiers were also added
- in the later versions.
-
- Early versions called the following chain of functions to
- get a byte of encoded data: gethuff, getbit, get_cnr,
- getc_crc and getc. It wrote through putce and putc. That's a
- lot of function calling. In addittion, gethuff and getbit
- were passed pointers to functions to identify get_cnr.
-
- Actually, those versions used a dummy function for get_cnr,
- the repeated value encoder, although the actual code was
- present. This was to simplify debugging and because USQ did
- not yet have the inverse of that translation.
-
- The benchmark for comparisons was not consistent because
- files were lost at two points. In effect, the current SQ.COM
- squeezed itself! It typically acheived 6-7% compression on
- a machine code file of 8K to 10K bytes. Of course machine
- code is not a practical case, but it is a rugged workout.
- Text files are compressed 33% to 46% depending on the
- richness and distribution of the alphabet.
-
- V0, for which listings have not survived, took 5:10 (five
- minutes, 10 seconds) to squeeze itself! This was improved to
- 4:23 by the optimizer option of the compiler, which simply
- generates in-line code rather than subroutine calls for all
- local and external variable accesses. It was further
- improved to 4:18 (and restored to its original length) by
- the -e compiler option which specifies the origin of the
- external variable area to allow direct addressing. (The BDS
- linker resolves only function names - externals are actually
- like FORTRAN COMMON and are normally accessed relative to a
- pointer kept in RAM!).
-
- Subsequent improvements came mostly from recoding the key
- routine. Copies of gethuff and its partner getbit are
- attached for versions V1 through V6 and the complete
- listings (20 pages) for V6 are included.
-
- In V0 through V2, gethuff forms an output byte by calling
- getbit eight times and packing the bits. This is the obvious
- method because the Huffman translation produces variable
- length bit strings, not a byte for a byte.
-
- V1 introduced the variable codebyte to the getbit function.
- It was rotated each time a bit was removed, so that
- subsequent calls had to shift it only one bit position. This
- involved considerable change. Timing is uncertain now.
-
- V2 continued to improve the getbit function by customizing
- the three basic cases and providing seperate returns from
- each to avoid unnecessary work.
-
- The changes of V1 and V2 reduced run time to 1:41, a
- whopping 61% reduction!
-
- V3 incooperated getbit into gethuff. This wasn't difficult
- because getbit was called only once by gethuff. It ran in
- 1:27 (on a slightly smaller file), another 14% reduction.
-
- V4 removed the pointers to functions mentioned earlier and
- substituted direct calls. However, at this point the real
- translation for repeated values was enabled. The net result
- was a slight loss of ground to 1:30, but more productive work.
-
- V0 through V4 worked from Huffman code bit strings of
- indefinite length accessed through an array of pointers.
- Each string was byte alligned (unlike the final encoded
- data).
-
- V5 was a complete redesign of the storage and retreival of
- the array of code strings. I had finally succeeded in
- proving that the maximum length code string would fit in the
- same space as the sum of all frequency counts, so scaling in
- init_huff was made more rigorous to fit them into unsigned
- integers (16 bits).
-
- This redesign paved the way for a relatively simple method
- of processing the code strings several bits at a time,
- rather than singly in an eight pass loop to form an output
- byte.
-
- At this this point the fancy file name processing, etc.,
- were added, increasing the size of SQ.COM from 7680 bytes to
- 10,112 bytes, an increase of 32% in the work performed by
- the "benchmark". V5 ran in 1:40, which scales to 1:16, a
- reduction of 16%. In a second variant, changing the variable
- cbitsrem to a char from an integer saved another 5%.
-
- V6 restructures the gethuff of V5, replacing the while loop
- with a custom (goto) loop with the exit condition tested
- only in a special case. The two basic cases also do only the
- work necessary to their cases. Also, squeeze in SQ.C calls
- putc directly and does its own check for write failure,
- saving one layer of function calls. It ran in 1:29 (scales
- to 1:08), a reduction of 6% from the second variant of V5.
-
- The overall performance improvement ratio, scaled for the
- one major change in the benchmark workload (but not taking
- credit for the enabling of the repeated character encoding)
- was about 4.5 : 1, or a reduction of 78%. The true
- improvement was probably a factor of 5.
- added, increasing the size of SQ.COM from 7680 bytes to
- 10,112 bytes, an increase of 32% in the work performe